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Preface 



This volume contains the Proceedings of the 7th International Conference on Text, Speech 
and Dialogue, held in Brno, Czech Republic, in September 2004, under the auspices of the 
Masaryk University. 

This series of international conferences on text, speech and dialogue has come to con- 
stitute a major forum for presentation and discussion, not only of the latest developments in 
academic research in these fields, but also of practical and industrial applications. Uniquely, 
these conferences bring together researchers from a very wide area, both intellectually and 
geographically, including scientists working in speech technology, dialogue systems, text 
processing, lexicography, and other related fields. In recent years the conference has devel- 
oped into a primary meeting place for speech and language technologists from many different 
parts of the world and in particular it has enabled important and fruitful exchanges of ideas 
between Western and Eastern Europe. 

TSD 2004 offered a rich program of invited talks, tutorials, technical papers and poster 
sessions, as well as workshops and system demonstrations. A total of 78 papers were accepted 
out of 127 submitted, contributed altogether by 190 authors from 26 countries. Our thanks 
as usual go to the Program Committee members and to the external reviewers for their 
conscientious and diligent assessment of submissions, and to the authors themselves for 
their high-quality contributions. We would also like to take this opportunity to express our 
appreciation to all the members of the Organizing Committee for their tireless efforts in 
organizing the conference and ensuring its smooth running. In particular, we would like to 
mention the work of the Chair of the Program Committee, Hynek Hermansky. In addition we 
would like to thank some other people, whose efforts were less visible during the conference 
proper, but whose contributions were of crucial importance. Dagmar Janouskova and Dana 
Komarkova took care of the administrative burden with great efficiency and contributed 
substantially to the detailed preparation of the conference. The TgXpertise of Petr Sojka 
resulted in the extremely speedy and efficient production of the volume which you are 
now holding in your hands, including preparation of the subject index, for which he took 
responsibility. Last but not least, the cooperation of Springer- Verlag as the publisher of these 
proceedings is gratefully acknowledged. 
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Invited Papers 



Speech and Language Processing: 

Can We Use the Past to Predict the Future? 



Kenneth Church 



Microsoft, Redmond WA 98052, USA, 

Email: churchOmicrosof t . com 

WWW home page: http : //research.microsoft . com/users/ church/ 



Abstract. Where have we been and where are we going? Three types of answers 
will be discussed: consistent progress, oscillations and discontinuities. Moore’s Law 
provides a convincing demonstration of consistent progress, when it applies. Speech 
recognition error rates are declining by lOx per decade; speech coding rates are 
declining by 2x per decade. Unfortunately, fields do not always move in consistent 
directions. Empiricism dominated the field in the 1950s, and was revived again in 
the 1990s. Oscillations between Empiricism and Rationalism may be inevitable, with 
the next revival of Rationalism coming in the 2010s, assuming a 40-year cycle. 
Discontinuities are a third logical possibility. From time to time, there will be 
fundamental changes that invalidate fundamental assumptions. As petabytes become 
a commodity (in the 2010s), old apps like data entry (dictation) will be replaced with 
new priorities like data consumption (search). 



1 Introduction 

Where have we been and where are we going? Funding agencies are particularly interested 
in coming up with good answers to this question, but we should all prepare our own answers 
for our own reasons. Three types of answers to this question will be discussed: consistent 
progress, oscillations and discontinuities. 

Moore’s Law [11] provides a convincing demonstration of consistent progress, when it 
applies. Speech recognition error rates are declining by lOx per decade; speech coding rates 
are declining by 2x per decade. 

Unfortunately, fields do not always move in consistent directions. Empiricism dominated 
the field in the 1950s, and was revived again in the 1990s. Oscillations between Empiricism 
and Rationalism may be inevitable, with the next revival of Rationalism coming in the 2010s, 
assuming a 40-year cycle. 

Discontinuities are a third logical possibility. From time to time, there will be fundamen- 
tal changes that invalidate fundamental assumptions. As petabytes become a commodity (in 
the 2010s), old apps like data entry (dictation) will be replaced with new priorities like data 
consumption (search). 

2 Consistent Progress 

There have been a number of common tasks (bake-offs) in speech, language and information 
retrieval over the past few decades. This method of demonstrating consistent progress over 
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time was controversial when Charles Wayne of Darpa was advocating the approach in the 
1980s, but it is now so well established that it is difficult to publish a paper that does not 
include an evaluation on a standard test set. Nevertheless, there is still some grumbling in the 
halls, though much of this grumbling has been driven underground. 

The benefits of bake-offs are similar to the risks. On the plus side, bake-offs help establish 
agreement on what to do. The common task framework limits endless discussion. And it 
helps sell the field, which was the main motivation for why the funding agencies pushed for 
the common task framework in the first place. 

Speech and language have always struggled with how to manage expectations. So much 
as been promised at various points, that it would be inevitable that there would be some 
disappointment when some of these expectations remained unfulfilled. 

On the negative side, there is so much agreement on what to do that all our eggs are in 
one basket. It might be wise to hedge the risk that we are all working on the same wrong 
problems by embracing more diversity. Limiting endless discussion can be a benefit, but it 
also creates a risk. The common task framework makes it hard to change course. Finally, 
the evaluation methodology could become so burdensome that people would find other ways 
to make progress. The burdensome methodology is one of the reasons often given for the 
demise of 1950-style empiricism. 

2.1 Bob Lucky’s Hockey Stick Business Case 

It is interesting to contrast Charles Wayne’s emphasis on objective evaluations driving 
consistent progress with Bob Lucky’s Hockey Stick Business Case. The Hockey Stick isn’t 
serious. It is intended to poke fun at excessive optimism, which is all too common and 
understandable, but undesirable (and dangerous). 

The Hockey Stick business case plots time along the x-axis and success ($) along the 
y-axis. The business case is flat for 2003 and 2004. That is, we didn’t have much success in 
2003, and we aren’t having much success in 2004. That’s ok; that’s all part of the business 
case. The plan is that business will take off in 2005. Next year, things are going to be great! 

An “improvement” is to re-label the x-axis with the indexicals, “last year,” “this year,” 
and “next year.” That way, we will never have to update the business case. Next year, when 
business continues as it has always been (flat), we don’t have to worry, because the business 
case tells us that things are going to be great the following year. 

2.2 Moore’s Law 

Moore’s Law provides an ideal answer to the question: where have we been and where are 
we going. Unlike Bob Lucky’s Hockey Stick, Moore’s Law uses past performance to predict 
future capability in a convincing way. Ideally, we would like to come up with Moore’s Law 
type arguments for speech and language, demonstrating consistent progress over decades. 

Gordon Moore, a founder of Intel, originally formulated his famous law in 1965, 
http://www.intel.eom/research/silicon/mooreslaw.h.tm [11], based on ob- 
serving the rate of progress in chip densities. People were finding ways to put twice as much 
stuff on a chip every 18 months. Thus, every 18-months, you get twice as much for half as 
much. Such a deal. It doesn’t get any better than that! 
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We have grown accustomed to exponential improvements in the computer field. For as 
long as we can remember, everything (disk, memory, cpu) have been getting better and better 
and cheaper and cheaper. However, not everything has been getting better and cheaper at 
exactly the same rate. Some things take a year to double in performance while other things 
take a decade. I will use the term hyper-inflation to refer to the steeper slopes and normal 
inflation to refer to the gentler slopes. Normal inflation is what we are all used to; if you put 
your money in the bank, you expect to have twice as much in a decade. We normally think 
of Moore’s Law as a good thing and inflation as a bad thing, but actually, Moore’s Law and 
inflation aren’t all that different from one another. 

Why different slopes? Why do some things getting better faster than others? In some 
cases, progress is limited by physics. For example, performance of disk seeks double every 
decade (normal inflation), relatively slowly compared to disk capacities which double every 
decade (hyper-inflation). Disk seeks are limited by the physical mechanics of moving disk 
heads from one place to another, a problem that is fundamentally hard. 

In other cases, progress is limited by investment. PCs, for example, improved faster than 
supercomputers (Cray computers). The PC market was larger than the supercomputer market, 
and therefore, PCs had larger budgets for R&D. Danny Hillis [7], a founder of Thinking 
Machines, a start up company in the late 1980s that created a parallel supercomputer, coined 
the term, “dis-economy of scale.” Danny realized that computing was better in every way 
(price & performance) on smaller computers. This is not only true for computers (PCs are 
better than big iron), but it is also true for routers. Routers for LANs have been tracking 
Moore’s Law better than big 5ESS telephone switches. 

It turns out that economies of scale depend on the size of the market, not on the size of 
the machine. From an economist’s point of view, PCs are bigger than big iron and routers for 
small computers are bigger than switches for big telephone networks. This may seem ironic 
to a computer scientist who thinks of PCs as small, and big iron as big. In fact, Moore’s Law 
applies better to bigger markets than to smaller markets. 



2.3 Examples of Moore’s Law in Speech and Language 

Moore’s Law provides a convincing demonstration of consistent progress, when it applies. 
Speech coding rates are declining by 2x per decade; recognition error rates are declining by 
lOx per decade. 

Figure 1 shows improvements in speech coding over twenty years [6]. The picture is 
somewhat more complicated than Moore’s Law. Performance is not just a single dimension; 
in addition to bit rate, there are a number of other dimensions that matter: quality, complexity, 
latency, etc. In addition, there is a quality ceiling imposed by the telephone standards. It is 
easy to reach the ceiling at high bit rates (> 8 kb/s). There is more room for improvement at 
lower bit rates. 

Despite these complexities, Figure 1 shows consistent progress over decades. Bit rates are 
declining by 2 x per decade. This improvement is relatively slow by Moore’s Law standards 
(normal inflation). Progress appears to be limited more by physics than investment. 

Figure 2 shows improvements in speech recognition over 15 years [9]. Word error rates 
are declining by lOx per decade. Progress is limited more by R&D investment than by 
physics. 
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Fig. 1. Speech coding rates are declining by 2x per decade [6]. 



Note that speech consumes more disk space than text, probably for fundamental reasons. 
Using current coding technology, speech consumes about 2 kb/s, whereas text is closer to 
2 bits per character. Assuming a second of speech corresponds to about 10 characters, speech 
consumes 10 2 times more bits than text. Given that speech coding is not improving too rapidly 
(normal inflation as opposed to hyper inflation), the gap between speech bit rates and text bit 
rates will not change very much for some time. 



2.4 Milestones and Roadmaps 

Figure 3 lists a number of milestones in speech technology over the past forty years. This 
figure answers the question, where have we been, but says relatively little (compared to 
Moore’s Law) about where are we going. The problem is that it is hard to extrapolate (predict 
future improvements). 

Table 1 could be used as the second half of Figure 3. This table was extracted from an 
Elsnet Roadmap meeting [3], 

These kinds of roadmaps and milestones are exposed to the Flockey Stick argument. 
When the community is asked to predict the future, there is a natural tendency to get carried 
away and raise expectations unrealistically. 

At a recent IEEE conference, ASRU-2003, Roger Moore (who is not related 
to Gordon Moore) compared a 1997 survey of the attendees with a 2003 survey 
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Fig. 2. Speech recognition error rates are declining by 10 x per decade [9], 
Table 1. Elsnet Roadmap (2000): Selected Milestones 



Year Milestone 

2003 Useful speech recognition-based language tutor 
2003 Useful portable spoken sentence translation systems 

2003 First pro-active spoken dialogue with situation awareness 

2004 Satisfactory spoken car navigation systems 

2005 Small-vocabulary (> 1000 words) spoken conversational systems 

2006 Multiple-purpose personal assistants 

2006 Task-oriented spoken translation systems for the web 
2006 Useful speech summarization systems in top languages 
2008 Useful meeting summarization systems 
2010 Medium-size vocabulary conversational systems 



(http : //www . elsnet . org/dox/moore-asru .pdf ). The 2003 survey asked the com- 
munity when a twenty milestones would be achieved, a dozen of which were borrowed from 
the 1997 survey, including: 

1. More than 50% of new PCs have dictation on them, either at purchase or shortly after. 

2. Most telephone Interactive Voice Response (IVR) systems accept speech input. 
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Fig. 3. Milestones in Speech Technology over the last forty years [13]. 



3. Automatic airline reservation by voice over the telephone is the norm. 

4. TV closed-captioning (subtitling) is automatic and pervasive. 

5. Telephones are answered by an intelligent answering machine that converses with the 
calling party to determine the nature and priority of the call. 

6. Public proceedings (e.g., courts, public inquiries, parliament, etc.) are transcribed 
automatically. 

Ideally, it should be clear whether or not a milestone has been achieved. In this respect, 
these milestones are better than the ones mentioned in Table 1 . 

Roger Moore’s most interesting finding, which he called the “Church effect,” is that the 
community had pushed the dates out 6 years over the 6 years between the two surveys. Thus, 
on average, the responses to the 2003 survey were the same as those in 1997, except that after 
6 years of hard work, we have apparently made no progress, at least by this measurement. 

The milestone approach to roadmapping inevitably runs the risk of raising expectations 
unrealistically. The Moore’s Law-approach of extrapolating into the future based on objective 
measurements of past performance produces more credible estimates, with less chance of a 
Hockey Stick or a “Church effect.” 

2.5 Summary of Consistent Progress 

Although it is hard to make predictions (especially about the future), Moore’s Law provides 
one of the more convincing answers to the question: where have we been and where are 
we going. Moore’s Law is usually applied to computer technology (memory, CPU, disk), but 
there are a few examples in speech and language. Speech recognition error rates are declining 
by lOx per decade; speech coding rates are declining by 2x per decade. 

Some other less convincing answers were presented. A timeline can tell us where we 
have been, but does not support extrapolation into the future. One can survey the experts 
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in the field on when they think various milestones will be achieved, but such surveys can 
introduce hockey sticks. It is natural to believe that great things are just around the corner. 

Moore’s Law not only helps us measure the rate of progress and manage expectations, but 
it also gives us some insights into the mechanisms behind key bottlenecks. It was suggested 
that some applications are constrained by physics (e.g., disk seek, speech coding) whereas 
other applications are constrained by investment (e.g., disk capacity, speech recognition). 

3 Oscillations 

Where have we been and where are we going? As mentioned above, three types of 
answers will be discussed here: consistent progress over time, oscillations and disruptive 
discontinuities. 

It would be great if the field always made consistent progress, but unfortunately, that isn’t 
always the case. It has been claimed that recent progress in speech and language was made 
possible because of the revival of empiricism. I would like to believe that this is correct, given 
how much energy I put into the revival [5], but I remain unconvinced. 

The revival of empiricism in the 1990s was made possible, because of the availability 
of massive amounts of data. Empiricism took a pragmatic focus. What can we do with all 
this data? It is better to do something simple than nothing at all. Engineers, especially in 
America, became convinced that quantity is more important than quality (balance). The use 
of empirical methods and the focus on evaluation started in speech and moved from there to 
language. 

The massive available of data was a popular argument even before the web. According 
to [8], Mercer’s famous comment, “There is no data like more data,” was made at Arden 
House in 1985. Banko and Brill [1] argue that more data is more important than better 
algorithms. 

Of course, the revival of empiricism was a revival of something that came before 
it. Empiricism was at its peak in the 1950s, dominating a broad set of fields rang- 
ing from psychology (Behaviorism) to electrical engineering (Information Theory). Psy- 
chologists created word frequency norms, and noted that there were interesting corre- 
lations between word frequencies and reaction times on a variety of tasks. There were 
also discussions of word associations and priming. Subjects react quicker and more ac- 
curately to a word like “doctor” if it is primed with a highly associated word like 
“nurse.” The linguistics literature talked about a similar concept they called colloca- 
tion (http://mwe.stanford.edu/collocations.html). “Strong” and “powerful” 
are nearly synonymous, but there are contexts where one word fits better than the other 
such as “strong tea” and “powerful drugs.” At the time, it was common practice to clas- 
sify words not only on the basis of their meanings but also on the basis of their co- 
occurrence with other words (Harris’ distributional hypothesis). Firth summarized this tra- 
dition in 1957 with the memorable line: “You shall know a word by the company it keeps” 
(http : / / www . wordspy . com/WAW/F irth- J . R . . asp). 

Between the 1950s and the 1990s, rationalism was at its peak. Regrettably, interest in 
empiricism faded in the late 1950s and early 1960s with a number of significant events 
including Chomsky’s criticism of n-grams in Syntactic Structures [4] and Minsky and 
Paper’s criticism of neural networks in Perceptrons [10]. The empirical methodology was 
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Table 2. Oscillations between Empiricism and Rationalism 
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considered too burdensome in the 1970s. Data-intensive methods were beyond the means 
all but the wealthiest industrial labs such as IBM and AT&T. That changed in the 1990s 
when data became more available, thanks to data collection efforts such as the LDC 
(http : / /ldc . upenn . edu/). And later, the web would change everything. 

It is widely assumed that empirical methods are here to stay, but I remain unconvinced. 
Periodic signals, of course, support extrapolation/prediction. The oscillation between empiri- 
cism and rationalism appears to have a forty year cycle, with the next revival of rationalism 
due in another decade or so. The claim that recent progress was made possible by the revival 
of empiricism seems suspect if one accepts that the next revival of rationalism is just around 
the corner. 

What is the mechanism behind this 40-year cycle? I suspect that there is a lot of truth to 
Sam Levenson’s famous quotation: “The reason grandchildren and grandparents get along so 
well is that they have a common enemy.” Students will naturally rebel against their teachers. 
Just as Chomsky and Minsky rebelled against their teachers, and those of us involved in the 
revival of empirical methods rebelled against our teachers, so too, it is just a matter of time 
before the next generation rebels against us. 

I was invited to TMI-2002 as the token empiricist to debate the token rationalist on what 
(if anything) had happened to the statistical machine translation methods over the last decade. 
My answer was that too much had happened. I worry that the pendulum had swung so far 
that we are no longer training students for the possibility that the pendulum might swing the 
other way. We ought to be preparing students with a broad education including Statistics and 
Machine Learning as well as Linguistic theory. 



4 Disruptive Discontinuities 

Where have we been and where are we going? There are three logical possibilities that cover 
all the bases. We are either moving in a consistent direction, or we’re moving around in 
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circles, or we’re headed off a cliff. . . Those three possibilities pretty much cover all the 
bases. 

A possible disruptive discontinuity around the corner is the availability of massive 
amounts of storage. As Moore’s Law continues to crank along, petabytes are coming. A 
petabyte sells for $2,000,000 today, but this price will fall to $2000 in a decade. Can demand 
keep up? If not, revenues will collapse and there will be an industry meltdown. There are two 
answers to this question: either it isn’t a problem, or it is a big problem. 

$2000 petabytes might not be a problem because Moore’s Law has been creating more 
and more supply for a long time, and demand has always kept up. The pundits have never 
been able to explain why, but if you build it, they will come. Thomas J. Watson is alleged to 
have grossly underestimated the computing market in 1943: “I think there is a world market 
for maybe five computers” (http : //en .Wikipedia. org/wiki/Thomas_J . _Watson). 

On the other hand, $2000 petabytes might be a big problem. Demand is everything. 
Anyone, even a dot-com, can build a telephone network, but the challenge has been to sell 
minutes. The telephone companies need a killer app to put more minutes on the network. 
So too, the suppliers of $2000 petabytes need a killer app to help keep demand in sync with 
supply. Priorities for speech and language processing will change; old apps like data entry 
(dictation) will be replaced with new priorities like data consumption (search). 



4.1 How Much Is a Petabyte? 

The easy answer is: 10 15 bytes. But the executives need a sound bite that works for a lay 
audience. How much is a petabyte? Why are we all going to buy lots of them? 

A wrong answer is 10 6 is a million, 10 9 is a billion, 10 12 is a trillion and 10 15 is a zillion, 
an unimaginably large number that we used to use synonymously with infinity. 

How much disk space does one need in a lifetime? 10 15 bytes per century is about 18 
megabytes per minute. Text cannot create demand for a petabyte per capita per lifetime. That 
is, 18 megabytes per minute is about 18,000 pages per minute. Speech also won't create 
enough demand, but it is closer. A petabyte per century is about 317 telephone channels 
for 100 years per capita. It is hard to imagine how we could all process 317 simultaneous 
telephone conversations forever, while we are awake and while we are sleeping. A DVD video 
of a lifetime is about a petabyte per 100 years (1.8 gigabytes/hour = 1.6 petabytes/century), 
but there is too much opportunity to compress video. In addition, there have been many 
attempts to sell Picture Phone in the past, with few successes (though that might be changing). 



Table 3. Bell & Gray’s Digital Immortality: Lifetime Storage Requirements [2] 



Data type 
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2 MB 
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60 MB 


5.0 TB 
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100 TB 


DVD video (4.3 MB/s = 1.8 GB/hour) 


20 GB 


1 PB 
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The future of the technology industry depends on supply running into a physical limit, 
which is unlikely. Moore’s Law might break down, but I doubt it. A more likely scenario 
is that demand might keep up. If we build it, they will come. The pundits like Bell & Gray 
might be underestimating demand by a lot. Again, I am not optimistic here; these pundits 
are pretty good, but demand has always kept up in the past. The best chance that I see for 
demand keeping up is for the speech and language field to make big progress on searching 
speech and video. The new priorities for speech and language should be to find killer apps 
that will consume disk space. 

Data collection efforts have tended to focus on public repositories such as the LDC and 
the web. There are far greater opportunities to consume space with private repositories, which 
are much larger (in aggregate). The AT&T data network handles a PB/day, and the AT&T 
voice network handles the equivalent of 10 Google collections per day. Local networks are 
even larger. 

The cost of storing a telephone call ($0.005/min) is small compared to the cost of 
transport ($0.05/min). If I am willing to pay for a telephone call, I might as well store it 
forever. Similar comments hold for web pages where the cost of transport also dominates the 
cost of storage. There is no point flushing a web cache if there is any chance that I might 
reference that web page again. 

Private repositories would be much larger if it were more convenient to capture private 
data, and there was obvious value in doing so. Currently, the tools that I have for searching 
the web are better than the tools that I have for searching my voice mail and my email and 
other files on my local network. Better search tools would help keep demand up with supply. 



5 Conclusions 

Where have we been and where are we going? 

In the 1970s, there was a hot debate between knowledge-based and data-intensive 
methods. People think about what they can afford to think about. Data was expensive; only 
the richest industrial labs could afford to play. The data-intensive methods were beyond the 
reach of most universities. Victor Zue dreamed of having an hour of speech online (with 
annotations) in the 1970s. 

In the 1990s, there was a revival of empirical methods. “There is no data like more data!” 
Everyone could afford to play, thanks to data collection efforts such as the LDC, and later, 
the web. Evaluation was taken more seriously. The field began to demonstrate consistent 
progress over time, with strong encouragement from Charles Wayne. The pendulum swings 
far (perhaps too far) toward data-intensive methods, which become the method of choice. Is 
this progress, or is the pendulum about to swing back the other way? 

In the 2010s, petabytes will be everywhere. (Be careful what you ask for.) This could be 
a big problem if demand can’t keep up with supply and prices collapse. On the other hand, it 
might not be a problem at all. Supply has always kept up in the past, even though the pundits 
have never been able to explain why. If you build it, new killer apps will come. Priorities will 
change. Dictation (data entry) and compression will be replaced with applications like search 
(data consumption). But even if everyone stored everything I can possibly think they might 
want to store, I still don’t see how demand can keep up with supply. 
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We present a new approach to determining the meaning of words in text, which relies 
on assigning senses to the contexts within which words occur, rather than to the words 
themselves. A preliminary version of this approach is presented in Pustejovsky, Hanks and 
Rumshisky (2004, COLING). We argue that words senses are not directly encoded in the 
lexicon of a language, but rather that each word is associated with one or more stereotypical 
syntagmatic patterns. Each pattern is associated with a meaning, which can be expressed in 
a formal way as a resource for any of a variety of computational applications. 

A crucial element in this approach is that it relies on corpus pattern analysis (CPA) to 
determine the normal contexts in which a word is used. Obviously, it would be impossible 
to determine all possible contexts of word use. An important finding in corpus linguistics 
over the past 15 years has been that, although words have an infinite (or virtually infinite) 
number of possible combinations with other words, the number of normal combinations is 
remarkably small and computationally manageable. Over the last half century, much effort 
has been devoted to analysing possible combinations (syntactic structures), in pursuit of 
the goal of determining all and only the well-formed sentences of a language. This effort, 
though laudable and often ingenious, has had the effect of allowing speculation about rare 
and unusual possibilities in syntax to swamp the great simplicities on which language in use 
depends: the normal, ordinary, typical patterns of word use. Dictionaries, too, have created a 
false impression of language complexity, in that they give equal prominence to rare, unusual, 
and merely possible senses of words, while neglecting to indicate the relative frequencies of 
the various senses. Very often, it turns out that sense 1, or senses 1 and 2 combined, account 
for 80% or 90% of all uses of a word. Special routines are of course needed to deal with the 
less common uses of words, but the current situation is that a collection of subroutines have 
been allowed to dominate or even stand in place of the core program that drives language in 
use. An approach that focuses on normal use cannot in itself eliminate ambiguity, but it can 
go a very long way to reducing lexical entropy. 

In the first part of our talk, we survey the current state of the art in selectional preference 
acquisition and in word sense disambiguation. Typically, selectional preference acquisition 
works on the basis of primary data (machine-readable text corpora or the web) but does not 
discriminate between different senses, so that, for example, sun bed, sun blind, sun cream, sun 
lounge, and sun terrace are interspersed as collocates of ‘sun' indiscriminately with Sun Life 
Assurance and Sun Microsystems. Approaches which do attempt word sense discrimination, 
on the other hand, rely on tools that were not specifically designed for the purpose - 
overwhelmingly, WordNet and machine-readable versions of dictionaries that were designed 
for human users. Characteristically, such resources present multiple senses of words, with 
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many fine sense distinctions, but without offering any procedure for distinguishing one sense 
from another. 

We go on to discuss the differences between the CPA approach on the one hand and 
FrameNet on the other. CPA is grounded mainly in the systemic-functional approach to 
linguistics of Halliday and Sinclair, but also owes much to Fillmore’s frame semantics. In 
frame semantics, the relationship between semantics and lexicosyntactic realization is often 
at a comparatively deep level, i.e. in many sentences there are elements that are subliminally 
present, but not explicitly expressed. For example, in the sentence “he risked his life”, two 
semantic roles are expressed explicitly (the risker, “he”, and the valued object, “his life”, that 
is put at risk). But at least three other roles are subliminally present, although not expressed: 
the possible bad outcome (“he risked death”), the beneficiary or goal (“he risked his life for 
her; he risked his life for a few dollars”), and the means (“he risked a backward glance”). 

CPA, on the other hand, is shallower and more practical: the objective is to identify, in 
relation to a given target word, the overt textual clues that activate one or more components 
of a word’s meaning potential. There is also a methodological difference: whereas FrameNet 
research proceeds frame by frame, CPA proceeds word by word, taking sample concordances 
from a corpus and analysing the sample exhaustively. 

CPA explicitly makes semantically motivated distinctions between syntagmatic patterns, 
that is, it addresses the problem of word sense disambiguation by asking what differences in 
sense are associated with differences in local context. By contrast, FrameNet researchers are 
required to think up all possible words in a Frame a priori. This means that important senses 
of a word that has been partially analysed are missing, and may remain missing for years to 
come. For example, at the time of writing the verb ‘toast’ is shown as part of the Apply _Heat 
frame, but not the Celebrate frame. It is not even clear whether there is (or is going to be) a 
Celebrate frame. No attempt is made in FrameNet to identify the senses, or normal uses, of 
each word systematically and contrastively. In its present form, FrameNet has as many gaps 
as senses, and it is not clear how or whether the gaps are going to be filled. In CPA, once a 
verb has been analysed, all its main senses are represented (and associated with patterns of 
usage), so that it can be used straight away for sense discrimination and other purposes. 

Our presentation then moves on to give details of CPA methodology. Normal uses 
of words are contrasted with exploitations. In CPA, the distinction between conventional 
metaphors and dynamic metaphors is important. Conventional metaphors are no more than 
another kind of normal use, but dynamic, ad-hoc metaphors exploit norms according to rules 
that can be described. So first we describe criteria for identifying normal uses and associating 
them with literal meanings, then we describe secondary normal uses such as conventional 
metaphors and idioms, then we explore the rules governing the exploitation of these norms. 
One set of exploitation rules are those governing coercion, as described in Pustejovsky’s 
Generative Lexicon theory. Thus, in “he ate the carpet” carpet is coerced by the verb into 
being an honorary, ad-hoc member of the set of foodstuffs. Another kind of exploitation 
involves ellipsis, such that the apparently incoherent (but really uttered) sentence “I hazarded 
various Stuartesque destinations. . . ” can be interpreted as an ellipsis of “I hazarded a guess 
at various Stuartesque destinations. . . ”, relying on the fact that hazard a guess is the most 
normal use of this verb in both British English (47% of all uses) and American English 
(80%). 
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Next, we look at lexical sets, and describe how lexical sets can be populated from a 
corpus. Hazard a guess is undoubtedly the most normal word for this verb, but in the British 
National Corpus we also find hazard a speculation , hazard a conjecture, hazard a suggestion, 
hazard an opinion, hazard an observation. Furthermore, in British English at least this verb 
is found as a reporting verb governing both direct speech and f/iaf-clauses. How are all these 
uses to be grouped together in such a way that the resultant lexicon entry activates just the 
right sense of the verb, in contrast to other senses, such as hazarding one 's life for a principle, 
where it is a synonym of risk ? 

The relationship between semantic types, semantic roles, and lexical sets requires 
detailed consideration. How do we know that, in the clause “. . . where the baby was treated", 
the baby is almost certainly a medical patient? Two clues in this clause greatly reduce the 
lexical entropy: the adverbial of location (where), and the absence of an adverbial of manner. 
The location is probably a hospital. The pattern underlying this clause contrasts with that 
underlying “she treated me like a servant” and “I believe everybody should be treated with 
respect”. 

We also identify systematic lexical alternation, so that for example the set [[Human = 
Doctor]] regularly alternates with [[Stuff = Medicine]], and the set [[Human = Patient]] 
regularly alternates with [[Condition = Illness I Injury]]. The items before the equals sign 
are semantic types and can be explicitly recognised in text, while the items after the equals 
sign may not be made explicit, i.e. a patient may simply be identified as “the baby”. Once the 
patterns have been teased out of the corpus, they are stored in a computational lexicon and 
made available for text processing. 
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Abstract. I will first sketch some background on the company ScanSoft. Next, I 
will discuss ScanSoft’s products and technologies, which include digital imaging 
and OCR technology, automatic speech recognition technology (ASR), text-to-speech 
technology (TTS), dialogue technology, including multimodal dialogues, dictation 
technology and audiomining technology. I will sketch the basic functionality of these 
technologies, a global sketch of the components they are composed of, demonstrate 
some of them, and illustrate the platform types on which they can be used. 

Finally I will sketch what is needed to develop such technologies, focusing not only 
on data but also on required modules and methodologies. 
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Part II 

Text 



“Text: a book or other written or printed work, regarded in terms of its 
content rather than its physical form: a text which explores pain and grief” 
NODE (New, Oxford Dictionary of English), Oxford, OUP, 1998, page 1998, meaning 1. 
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Abstract. This paper describes an algorithm which represents one of the few 
linguistics-based systems for word-to-word alignment. Most systems are purely 
statistic and assume some hypotheses about the structure of texts which are often 
infirmed. Our approach combines statistic methods with positional and linguistic ones 
in order to can be successfully applied to any kind of bitext as far as the internal 
structure of the texts is concerned. The linguistic part uses shallow parsing by regular 
expressions and relies on very general linguistic principles. However a component 
of language-specific methods can be developed for improving results. Our word- 
alignment system was evaluated on a Romanian-English bitext. 



1 Introduction 

Most systems treating the word alignment of bitexts are based on purely statistical methods. 
Therefore, underlying assumptions had to be taken in order to fit statistics to natural-language 
data. Some of them assume that the large majority of alignments are 1 : 1, that sentence 
extremities coincide in the two languages of the bitext and inside sentences word order 
is preserved, or that the texts contain few omissions or additions. As it was pointed out 
many times in the literature, these assumptions do not hold for all translation fragments in 
texts (especially those belonging to novels or newspapers), nor for any two languages. This 
paper aims at showing that, without getting rid of statistic methods, linguistics can help and 
surpass the limits imposed by the statistically useful but too restrictive hypotheses. The work 
this paper relies on consists in building a word-to-word alignment system (validated on a 
Romanian-English bitext) that, contrary to mainstream approaches, gives to linguistics the 
main role in improving alignment results. The linguistic level of our approach is general, 
simple and restricted to using regular expressions for shallow syntactic analyses [1], The 
paper is structured as follows. The first section graphically presents the shortcomings of some 
statistic assumptions about the structure of the texts to be aligned. The main section describes 
a word alignment system that uses language-based and positional methods adequate to any 
kind of text structure. Sections about the evaluation of the system and conclusions end our 
paper. 

2 A Hint About Statistic Hypothesis Drawback 

Dan Melamed’s approach [2] is a typical statistical model, where most of the mentioned 
assumptions are present. For instance, he assumes that the words of a bitext can be displayed 
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inside a rectangle by one side and another of its diagonal, from its lower left corner, 
representing the two texts’ beginnings, to the upper right one, representing the texts’ end. 
On the other hand, bitext maps are supposed to be injective (1-to-l) partial functions in bitext 
spaces. Consequently, the typical pattern of points of correspondence in a bitext (whatever its 
length) on which Melamed’s method applies looks like that in Fig. 1 (cf. [2]). 



terminus 




Fig. 1 . Typical pattern of points of correspondence in a bitext 



However, if the correspondence points of a bitext are those in Fig. 2 it is hard to figure 
out how the model could give good results for this kind of bitext. Note that Fig. 2 represents 
the alignment map of a sentence 1 in the gold standard provided within the shared task of 
HLT/NAACL 2003 workshop on “Building and Using Parallel Texts: Data Driven Machine 
Translation and Beyond” [3]. As one can see, in this bitext there are m : n mappings and 
translation omissions, and word order is not preserved. Actually, it is worth mentioning that 
in the whole gold standard omissions represent 13.34%, m : n mappings 43.01% and 1 : 1 
correspondences only 42.65%. 

Our approach uses, at the first step, a statistical method relying on the 1 : 1 

correspondence assumption, for getting alignment anchors, but it does not restrict to that. 
At the following steps, it is assumed that, in a translation unit, word and phrase order can 
be different in the two languages, that there can be omissions and m : n mappings and that 
texts obey linguistic rules only. Therefore our system tries to combine the statistics power in 
capturing general facts with the flexibility offered by linguistics, in the following way. 



3 A Positional Word-Alignment System 

The word alignment system has as input an extracted lexicon and a parallel corpus 
tokenized, lemmatized, morpho-syntactically annotated and sentence aligned. Units of 

1 It is about the sentence #71: EN: Could it be that the police and the prosecutors adopted that 
attitude as they grew fond ofTreptow ? RO: I-ofi apucat dragul de Treptow de au adoptat o asemenea 
atitudine? 
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Fig. 2. Sample of true bitext map 



sentence alignment are called translation units 2 . Note that all these pre-processing steps 
applied to the texts of the parallel corpus are statistics-based and that the tagging process 
paves the way for further linguistic treatments. The word aligning is performed sequentially 
for each translation unit apart. Each word is identified by its position in the sentence 
separately for each half of the bitext. The output is a list of position correspondences. For 
instance, for the bitext in Example 1 , the system should produce the list of assignments given 
below (where ‘—1’ marks not translated word): 

Example 1. Bitext mapping. 

RO: 0>ajunge 1>! 

EN: 0>that l>be 2>enough 3>! 

word alignment: (0 2), (1 3), (—1 0), (—1 1) 

Our alignment approach is made up of three main stages. The first one is already very 
common and consists of a rough ambiguous alignment based on an extracted translation 
lexicon. The second solves ambiguities and suspicious alignments by using positional 
criteria. Finally, the third stage resorts to linguistic methods in order to align words about 
which the translation lexicon says nothing. All these phases are presented below. 

3.1 The Rough Alignment 

The rough alignment in our system is based on the output of the translation equivalents 
extractor TREQ [4]. This process is applied to a training parallel corpus including the bitext 
to be aligned. Our experiments emphasized that extracting a lexicon from a training corpus 
leads to better results than using an external dictionary. Of course, that does not surprise 
anybody now. 

The extracting algorithm relies on two underlying assumptions: 

2 Translation unit example: (tu id=“Ozz.l”)(seg lang=“en”)(s id="Oen.l”) 

(w lemma=“that” ana=“2+,Di’”)that(/w){w lemma=“be” ana=“l+,Vm’') 

’s(/w)(w lemma=“enough” ana='T4+,R”)enough(/w)(c)!(/c) (/s) 

( /seg ) (seg lang=“ro”)(s id=‘'Oro.l'’)(w lemma=“ajunge” 
ana='T +,Vmnp”)ajunge(/w) (c) ! (/c) (/s) (/ seg) (/tu) . 
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1. a lexical token is translated by only one token [5]; 

2. words in a translation equivalence have the same part-of-speech. 

These assumptions are restrictive, indeed, but they do not prevent additional processing 
units from recovering some of the missed or incomplete translations, as we shall see later. 

In order to capture cross part-of-speech translations we have created metacategories 
covering those parts-of-speech liable to overlapping. For instance, adjectives, nouns and 
verbs, in general the most crossed parts-of-speech, are tagged with the same metacategory 
(treated itself as a single part-of-speech), in order to be extracted according to the second 
upper assumption. This way, a translation equivalence can put in correspondence an adjective 
with a verb, a verb with a noun and so on. Even if using metacategories diminishes the general 
performances of the extractor, the result is pretty good for our word aligner, which is quite 
able to filter out extractor’s errors. 

Besides the translation lexicon, the rough alignment also makes use of cognates detection. 
For each alignment-candidates pair in a translation unit, not existing in the lexicon, the LCS 
score is computed [6]. The threshold is experimentally set to 0.65 for the Romanian-English 
bitext. 

Punctuation marks of the same type are automatically considered cognates and, of course, 
no score is calculated for them. 

In conclusion, the anchor points in our system are the extracted translation equivalents 
and cognates, including punctuation marks among which open-closed marks (parentheses, 
brackets, etc.) play a special role. 

Example 2 (where the parentheses indicate parts-of-speech 3 and the stars mark cognates) 
illustrates a bitext and the proper ambiguous mapping achieved at this stage. 

Example 2. Rough alignment of a bitext. 

RO: 0>basca (r) l>aer (n) 2>de (s) 3>balacareala (n) 4>ordinar (a) 5>promovat (v) 
6>in (s) 7>societate (n) 8>. (b) 

EN: 0>not (q) l>to (q) 2>mention (v) 3>the (d) 4>atmosphere (n) 5>of (s) 6>vulgar (a) 
7>scandal (n) 8>promote (v) 9>in (s) 10>the (d) ll>society (n) 12>. (b) 

0 basca /mention-2 

1 aer /atmosphere-4 

2 de /to- 1 /of-5 /in-9 

5 promovat /promote-8* 

6 in /to-1 /of-5 /in-9 

7 societate /society-11* 

8 . /.- 12 * 



As one can see, there are words to which no translation equivalences are assigned (for 
example Romanian positions 3 and 4). This is either because the extractor fails to find a 
translation equivalence for them or they are not translated at all. There are also words that 
have associated ambiguous lists (e.g. Romanian positions 2 and 6). Therefore the next task is 
to solve the following three main problems: 

3 Parts-of-speech notations used in this paper: a=adjective, b=punctuation mark, c=conjunction, 
d=determiner, m=numeral, n=noun, p=pronoun, q=particle, r=adverb, s=preposition, t=article, 
v=verb, y=abbreviation 
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1. dictionary ambiguities, especially concerning functional words; 

2. dictionary errors; 

3. gaps, that is, words escaping primary alignment, due either to translation equivalence 
missing or to translation omissions in a language or the other. 



3.2 The Positional Searching 

Searching for alignments is done in two scanning steps. Both are applied upon source words 
that have associated a translation equivalence list. 

First step aims at a rough alignment disambiguation. It goes through the words that have 
assigned at least one translation equivalent from the beginning to the end of the sentence. 
Choosing an alignment candidate from an ambiguity list mainly relies on a positional 
criterion. Usually there wins the target word that is the closest (or identical) to the previously 
linked target word. However, (long) gaps and text dislocations can disturb the alignment 
according to the previous word and therefore the position compared to the current source 
word is also taken into account. The non-ambiguous translation equivalents are automatically 
assigned. This can lead to multiple associations for a single target word. 

At this level, if consecutive words in the source part are associated with consecutive or 
close to each other words in the target part, these are taken as forming an alignment chain, 
considered as a reliable alignment. Cognates reinforce alignments. 

Shortly, the selection criterion for this step is the minimal value of a function 
f(cog, d a ,d r ), where cog is the cognate status, d a is the positional distance to the previ- 
ous assignment, d r is the relative distance to the source position. 

The target of the second step is to correct errors and to refine the previous rough 
alignment. It eliminates multiple associations and deletes suspicious links. The scanning 
direction, this time, is from the end of the sentence to its beginning. Now the algorithm 
takes into account more information than at the first step, such as the distance to the back 
assignment, the distances to the forward two assignments, the distance between source 
positions and the alignment chains. The result is a strict one-to-one word mapping, which 
can reflect modifications or even deletions of the links in the previous step, if no translation 
equivalent satisfies the alignment criterion. Note that this criterion affects both the ambiguous 
and non-ambiguous positions. 

At this step, the main problem the algorithm has to cope with is the ambiguity of 
functional words. They are very frequent and very ambiguous and therefore they could 
easily mislead the position-based aligning. That is why, at this point, the system appeals to a 
general linguistic assumption, namely that sentences in any language have internal structure 
(even those with free word order). In other words, sentences consist of syntactical phrases 
with great cohesion between their elements. For instance, determiners gravitate round their 
nominal head, prepositions precede noun phrases, particles and auxiliaries stay close to verbs 
or conjunctions precede conjuncts. 

Regular expressions in Example 3 capture this general linguistic assumption about 
elements cohesion. 
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Example 3. Regular expressions illustrating syntactical cohesion. 

1. [dp]* [ ar ]* [nvm] 

2. [ cq ] n* [arv] 

At the first reading of the text such sequences are memorized (in each language) in order 
to detect relations between functional and main words. Relying on that, we have set the 
precedence constraint imposing that a functional word is linked in the bitext map depending 
on the element preceded by it. So for example, if two nouns are linked then the closest 
prepositions or determiners preceding them are also linked. For instance, in the Example 2 
above, the nouns societate (RO-7) and society (EN-11) are linked. That triggers the linking 
of the prepositions in (RO-6) and in (EN-9) as well, because they are the closest to already 
assigned nouns. Thus, the ambiguity is eliminated. 

In order to get an idea about the efficiency of this constraint, our experiments show that 
it grows the precision of the system with 2.33% and the recall with 0.45%. 



3.3 The Treatment of Gaps 

Once the translation equivalents are aligned, the system goes to the treatment of gaps, that is, 
of words left unlinked in any of the two languages. 

For achieving this task, in every text, alignment segments are delimitated. These are 
pieces of text that begins with a conjunction, a preposition or a punctuation mark and ends 
with the token preceding the next conjunction, preposition, punctuation or end of sentence. 
That simulates somehow a chunker task for prepositional and noun phrases and exploits the 
fact that conjunctions (either coordinating or subordinating ones) always precede conjuncts. 
These segments, on their turn, are aligned depending on the previous mapping (based on the 
translation equivalents). The result can generate 1 : 1 or m : n mappings of segments. The 
aligned segments are then inspected in order to align unlinked words. First, words of the 
same part-of-speech are mapped. After that, the rest of the words are submitted to different 
linguistic heuristics, general or language-specific. 

A general method is based again on the word cohesion inside syntactic phrases and 
consists in aligning consecutive words in pairs. For instance, given the sequence adjective 
noun (seen as a pair) in one language and the pair/sequence noun noun in the other, if 
two nouns are linked, then the other elements are going to be aligned too. Such a group 
cohesion turns out to apply for prepositions, articles, nouns and adjectives, but also adverbs, 
prepositions and conjunctions, or verbs, particles and conjunctions. Example 4 illustrates 
some regular expressions controlling such pairs. 

Example 4. Regular expressions for paired assignments 

1. [nay] [nay] 

2. [brsc] [brsc] 

3. [qvc] [qvc] 

We have also applied some language-specific rules concerning Romanian versus English 
syntax particularities or cross-linguistic differences in part-of-speech mapping. Such cross- 
linguistic differences refer, for instance, to mood/tense verbal particles or auxiliaries different 
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in the two languages. An example of syntactic particularity is that, usually, the English 
phrases of two nouns: nounl noun2, e.g. chocolate candy , are translated into Romanian as 
noun2-the preposition ‘de’ -nounl: bomboana de ciocolata. On the other hand, Romanian has 
a lot of articles and particles mapping into English determiners and prepositions, respectively. 
The language-specific module is a distinct unit in our system, in order to keep the generality 
of the algorithm, and this module can be adapted for other pairs of languages than Romanian 
and English. 



4 Evaluation and Further Work 

Our system participated in the shared task organized within HLT/NAACL 2003 workshop 
[3]. At the moment our system had got good results in terms of precision (P), recall (R) 
and f-measure (F), namely 81.29%, 60.26% and 69.21%, respectively, while the best results 
belonged to XRCE.Nolem system [7] and amounted to P=82.65%, R=62.44%, F=71.14%, 
for non-null alignments (cf. [3]). Since then we have continuously improved the algorithm 
and, as one can see, the performances are now better than those of the other participant 
systems at that moment, namely P=85.56%, R=65.68%, F=74.31% for non-null alignments, 
and P=67.15%, R=67.44%, F=67.29% for alignments including translation omissions. These 
results are detailed in Table 1 (where GS stays for gold standard and POS for part-of-speech). 



Table 1 . Alignments general evaluation. 



Alignments 


GS 


Our system 




# of links 


Prec.[%] 


Recall[%] 


Total 


7149 


67.15 


67.44 


Null 


954 


31.03 


78.82 


Same POS 


3959 


88.32 


81.43 


Cross-POS 


2236 


77.26 


49.55 


Non-null 


6195 


85.56 


65.68 



Further work should be done. It is worth mentioning that null-alignments of our system 
represent 33.74% from total, while those in gold standard only 13.34%, because our approach 
assigns a null-alignment to any word failing to be aligned. This actually reflects the low power 
of the algorithm for detecting multi-word expressions. However, we are very confident that 
the linguistic module for treating gaps, including multi-word expressions, can be developed 
also by using shallow syntactic analyses and positional search and we are going to lead our 
research on this direction. 

5 Conclusions 

The approach sketched here presents a word-to-word alignment algorithm that does not 
impose any structural restriction to the texts to be aligned. Thus translation dislocations and 
omissions are not anymore a problem. 
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Given a lexicon (previously) extracted from a training parallel corpus, the algorithm can 
be applied to bitexts however short they are, by using positional and lexical information 
at local level. On the other hand, the lexical information consists in morpho-lexical tags 
and some very general syntactic patterns implemented by regular expressions. The simple 
linguistic methods used turn out to be very efficient. 

The algorithm paves the way for detecting multi-word expressions, which represents 
another challenging issue in computational linguistics at the time being. 
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Abstract. Translation of Multi-word expressions (MWEs) is one of the most chal- 
lenging tasks of a Machine translation (MT) system. In this paper, we present an inno- 
vative technique for dealing with MWEs in the context of MT. The technique permits 
bilinguals to give translations of MWEs in the form of patterns, without requiring them 
to be trained linguistically. The interpretation of the patterns is done by a dynamic ma- 
chine learning algorithm, which allows the main rule-based MT system to operate 
based on linguistic rules. Thus, the bilingual patterns (without any explicit linguistic 
input) are used in conjunction with the main linguistic system. This is made possible 
by the learning pathway templates. These templates need to be specially prepared by 
trained linguists only once. After that they help to process potentially a large number 
of patterns. 

The implemented system is being used with a large-scale rule-based MT system to 
improve its performance. This framework can also be extended to help example-based 
or statistical MT systems to deal with MWEs. 



1 Introduction 

This paper addresses the problem of developing a technique for handling MWEs in the 
context of a rule-based MT system. 

MWEs are expressions with a special meaning, which cannot be derived from its 
component words. MWEs include, among others, idioms (‘kick the bucket’ instead of ‘die’), 
phrasal verbs (‘carry on' instead of ‘continue’), and compounds (‘judicial enquiry’). A typical 
natural language system assumes each word to be a lexical unit, but this assumption does not 
hold in case of MWEs. They have idiosyncratic interpretations that cross word boundaries. 
Thus, identification and generation of MWEs has been a major concern for scholars working 
in this area and these are, therefore, considered a ‘pain in the neck’ (Sag et al., 2002). 

Even though, several of these MWEs are not compositional semantically, they behave 
like any other phrase syntactically, i.e., they take inflections, modifiers etc, and undergo 
syntactic operations such as passivization etc. Therefore, when it comes to translating such 
MWEs it becomes all the more complex since, after identification, they need to be processed 
linguistically. Their corresponding target language equivalents also need to be generated. 
Hence, a large dictionary is required to better the performance in translation but it often 
becomes a bottleneck as building dictionaries is not an easy task. It requires immense amount 
of time and effort. It is not always possible to either automatically generate this data or have 
language experts to develop this. 
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On the other hand, ordinary bilinguals are a rich resource of such data. They can 
provide parallel expressions in the two concerned languages. If one can tap on this resource, 
sufficiently large amount of data can be prepared in a shorter time. Though, the data thus 
created would lack linguistic knowledge necessary for analysis. It cannot, therefore, be 
used directly, for processing in a rule-based MT system. However, if a mechanism can be 
developed to interpret the data linguistically and then use it in conjunction with the rest of the 
linguistic processing, this can be an effective approach for handling MWEs in an MT system. 

The present paper is an attempt towards evolving a mechanism whereby the MWEs are 
incorporated in the main linguistic processing system through learning pathway templates. 
The idiom dictionary itself is developed by bilinguals who need not have any linguistic 
training. A simple notation is developed using which non-linguists can specify patterns for 
MWEs. These patterns are connected to the main linguistic system using Learning Pathway 
Templates. The concept of templates is interesting and can be generalized in other kinds of 
learning templates. 

This system is implemented and is being used with a large scale English to Hindi MT 
system. 

2 General Problem 

MWEs are special constructions that require appropriate representation and analysis. Several 
of them show lot of variation (Segond & Tapanainen, 1996). However, there is much in 
MWEs that is mechanizable and computable. 

An example of a MWE of this type is, 

“simmer with anger” (1) 

The above MWE can undergo the following types of variations. 



2.1 Morphological Variation 

‘Simmer’ can be inflected for tense, aspect and modality, e.g., simmers, simmering, 
simmered. Similarly, nouns in the MWEs can occur in varying forms depending on the 
gender, number etc. 

2.2 Insertions 

New words (qualifiers) can be inserted within the MWEs, as in 
“(sadly) simmering with (quiet) anger” 

OR 

"has simmered (for quite a while) with (quiet) anger.” 



( 2 ) 

(3) 



2.3 Replacement 

In an MWE, one word can replace another without affecting the overall characteristics of that 
MWE. For example, 

“simmer with [fear]” or “simmer with [pain]" have similar characteristics as (1). The words 
in [] are words that took the place of the actual word. 
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The MWEs obey and fit inside the linguistic framework of the language (Wehril, 1998) 
ie., though these MWEs have a non-compositional semantics, structurally, they behave like 
any other phrase in a sentence with various parts of the expression taking their inflections, 
allowing modifications etc. Clearly, processing such expressions is not easy. While handling 
MWEs, we have to combine the specialized processing of MWE, with the usual processing of 
a sentence. The two processes would ensure that the non-compositional part is taken as a unit, 
and processed further using the usual mechanism of producing meaning out of compositional 
parts. 

Identifying such MWEs and providing their meaning can be done by people. As 
mentioned above, we have found that people familiar with the application can provide the 
meaning without necessarily having elaborate linguistic training. For applications such as 
machine translation, the number of MWEs to be handled for good quality translation is 
extremely large. Therefore, it is crucial to be able to use the contributions of large number 
of bilinguals without requiring that they learn linguistics first. A problem, with the data thus 
created, as mentioned earlier, would be how to integrate it into the larger linguistic system 
since it will lack the necessary linguistic information. As illustrated by some examples earlier 
(simmer with anger), unless the two are combined, the coverage provided by the MWE data 
would be miniscule. 

The proposed solution is to couple the data with the linguistic system. This coupling 
can be done using the learning pathway templates. The above also fits in with the machine 
learning of rules, etc, in case we want the machine to “abstract” and generalize from the 
data thus provided. In fact, machine learning becomes possible with much smaller amount 
of data, because the data already fits in with the linguistic framework and the generalizations 
can work along well-defined pathways. 



3 An Overview of Related Work 

Before proceeding with this approach a brief mention of related work is presented. 

Segond and Tapanainen’s work on ‘Using a Finite-state based Formalism to Identify and 
Generate Multi-word Expressions’ (Segond & Tapanainen, 1996) demonstrates how a multi- 
word expression can be encoded, and how their compiler would use them to identify the 
MWEs. In contrast, our formalism is simple, yet expressive. The person providing the MWEs 
can provide the data in a most natural way. The system learns the linguistic characteristics of 
the MWE on its own, using the learning pathway templates. The system then uses the learned 
patterns to identify MWEs in a sentence. 

Wehrli’s work on translating idioms (Wehrli, 1998) talks about how MWEs can be 
used by a linguistic system. It also talks about the transfer and generation of idioms in 
its framework. Our approach of generation is similar to Wehrli’s approach. Our framework 
introduces the concept of compiled pattern, which is used to do the generation robustly. 

The MWEs, which can be collected using our framework, can also be induced into the 
example base of a lexical EBMT system (Brown, 1999). Every MWE can be represented as 
a separate equivalence class (token). The translation of this token can be remembered as the 
‘substitution string’ (suggested in this framework). 
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4 Framework 

4.1 Patterns for MWEs 

The setting of our application is a rule-based MT system, which is already available. Its 
coverage might be low, or the output quality poor. A typical reason could be that it does not 
handle MWEs. The outputs produced by the system are looked at by bilinguals (call them 
language editors) who are not trained in linguistics. They correct the translation and in case 
they find that the error is due to an MWE, they also provide the MWE and its translation. They 
are also encouraged to provide simple patterns to cover a class of MWEs. This, however, does 
not require them to provide linguistic analysis. 

Each input provided by the language editors consists of two items: 

1. Example sentence (with a MWE) in source language and its translation in the target 
language. 

2. MWE pattern in the source language and its translation (the MWE in the pattern must 
occur in the example sentence). 

It is a requirement that each pattern must be lexicalized, which means that there must be 
at least one lexical item associated with the pattern. 

An example pattern (PI) for English to Hindi MT system looks as follows: 

Pattern PI: 

1. Example sentence and its translation: 

Godhra is simmering with anger 
Godhraa krodha se bhabhak rahaa hei 

(anger) (INSTR) (heated_state)(ing) 

2. MWE pattern and its translation 
Simmering with anger* 1 
krodha* 1 se bhabhak rahaa hei 
(anger) (INSTR) (heated_state)(ing) 

* 1 = anger, frustration, pain 

4.2 Notations 

Variables in the pattern are marked by *1 in example (5) says that anger is a variable and 
can be replaced by any of the words anger, frustration or pain. If a list of words is not given, 
then it can be replaced by any other word of same category, which is noun in example (5). * 1 
also associates anger to krodha in the target language. 

In this pattern, all the inflections of a word are allowed. Thus, the example pattern (4) 
would allow simmering, simmers, simmered etc. In case of verbs, auxiliary verbs can also be 
added; for example ‘has been simmering’ (As we will see shortly, it is the linguistic analysis 
that makes it possible.) If the user wants to disallow other forms of a word, he can put T to 
indicate this. 

For example in, 

Regret to tell! =£- bataate hue dukha haiki 
(while eating) (sad) (is that) 



(4) 



(5) 
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The T symbol says that the form of tell is fixed and cannot occur as: tells, telling etc. No 
auxiliary verbs are allowed either. Tell itself can be generalized to any word with the same 
lexical category, in other words, any verb, such as say, eat, can occur instead of tell but the 
form must remain the same. 



Operators 


Root 


Category 


Other features 


* 


X 


T 


X 


I 


T 


T 


T 


Not(*,!) 


T 


T 


X 



4.3 Learning Pathway Templates 

Learning pathway templates connect the patterns to their linguistic analysis. These meta- 
patterns are specified by experienced linguists working on the MT system. Once specified, 
they are used by the MT system to linguistically interpret the patterns given by the language 
editors, and to use the patterns appropriately while translating. 

The templates are small in number and each covers a set of patterns, which have the 
same linguistic analysis. They specify the head of the MWE, associations from the source 
language pattern to the target language pattern and also give the agreement between different 
components of a pattern. 

For example, here is a learning pathway template: 

Template Tl: (6) 

VG& PP => PP VG& 

{tgt_vibh=‘INSTR’ j 

Tl states that some of the patterns given by the language editors consist of two 
components: VG (for verb group) followed by PP where VG is the head of the pattern 
(marked by It further says that in the translation, the order of the two components 
is reversed (shown after '=>■’). It also specifies the value of a feature called tgt_vibh (namely, 
the case ending in target language output). 

4.4 Compiling the Patterns Using Templates 

We now illustrate the process by which an MWE pattern may be compiled for future use, 
utilizing the templates. It can be done in two ways: 



Using the user patterns: Parts-of-Speech taggers and chunkers are used to assign POS-tags 
to words and to group them into chunks in the example sentences (given along with a pattern) 
in source and target languages. The tag and chunk information from the sentences are induced 
in the pattern. For a word not marked by **’ or T, in a given pattern, only roots and lexical 
category are kept, other features such as gender, number, person and tense, aspect, modality 
are dropped. 

For words marked by root is also dropped, only lexical category is kept. For words 
market by “!’, other features are retained irrespective or root. For example, the pattern PI 
looks as follows after the induction process: 

((simmering)) [[with N* 1 ]] =>■ 
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VG PP 

[[ N*1 se ]] ((bhabhaka)) 

PP VG 

Heated state 

Next, the patterns with chunks are matched with templates and the specified heads and 
features are transferred from the matched template to the pattern. 

After this processing using template Tl, the processed pattern PI looks as follows. 
(Pl-Tl): 

((simmering))& [[with anger* 1]] =>> 

[[krodha*l se]] ((bhabhaka))& 

]tgt_vibh=‘INSTR' } 

PP VG 

* 1 = anger, frustration, pain, & marks the head. 

Note that feature tgt_vibh gets transferred from template Tl to the compiled pattern. The 
target language pattern is also compiled into the substitution string. For example, we get after 
compilation: 

* l_se_bhabhaka 

where *1 indicates that an element corresponding to the variable is to be substituted 
followed by case marker ‘se’ and verbal root bhabhaka. Note that the order of elements is 
as given in the RHS of the template, which means that the reordering would get done as, 
specified by the pattern while generating the above compiled string. 




Fig. 1. Compiling the pattern 
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Using the analyzed patterns This approach is similar to the first approach. The only 
difference is that here we receive a dictionary of MWEs that has already been analyzed 
by linguists and we proceed from there. This data is readjusted automatically to make it 
compatible with the chunks that the chunker forms. It can then be processed in the same way 
as the 1st approach ie., it is matched with the learning pathway templates to get the compiled 
pattern and substitution string. 

4.5 Processing MWE in an Actual Sentence 

The MT system first does chunking and the morph analysis of the given input sentence for 
machine translation. This identifies root, part-of-speech tag and other features for each of the 
words, besides grouping them into chunks. Now, the root of words in an input sentence is 
matched with the compiled patterns. The process is efficient because the lexical items in the 
pattern are used in matching with the given input sentence. 

For example, after chunking the given input sentence, 

“Godhra was simmering with quiet pain” 

We get the following matches of roots or words (marked by *#’); 

[[Godhra#]] ((was simmering#)) [[with quiet pain#]] 

NP VG PP 

The above step generalizes from language elements in patterns to linguistic structures 
and is a crucial element in processing. This generalization is made possible with the help 
of taggers and chunkers which are used while processing an input sentence and the learning 
pathway templates which were used in the compilation of the patterns. Indeed the name 
pathway was chosen because it specifies a path leading from language data to linguistic 
theory. The pathway templates were used in compiling the templates. The same pathway 
can also be used if we try to generalize out of large collections of patterns when operating 
in a purely learning mode (though not discussed here). After the step discussed above, the 
processing procedures in the usual way by the linguistic processing MT system. Therefore, 
adverbs, adjective, etc intervening between the matched chunks are handled without any 
problems. In other words, the benefits of linguistc processing are available even though the 
MWE is being processed in a special way. Finally, the substitution of the target language 
expressions is done in a special way. All the chunks except the matched ones are substituted 
by the target language expressions in the usual way. For the matched chunks, no substitution 
is done for the non-head chunks. The matched chunks marked as the head (as specified by 
the pathway template) is substituted by the compiled target string. The Fig. 2. illustrates the 
substitution, where * 1 is obtained by translating in the usual way by the linguistic system, 
ie., 

[[with quiet pain]] ==> shaanta darda 

which is then substituted at the appropriate in the compiled target expression. 

Finally, the appropriate inflections are generated, yielding actual word forms. For the 
matched chunks, the pathway template may overwrite and specify its own values. While 
generating the inflections, it uses the case endings (vibhakti) and other features in the usual 
way without any special way for MWEs or for non-MWEs. For example, the following gets 
generated; 

Godhraa shaanta darda se bhabhaka rahaa hei 

(Godhra quiet pain with heated state-ing) 
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Where ‘Godhraa’, the subject, agrees in gender, number, person with the verbal form 
of ‘bhabhaka’, and tense-aspect-modality of the verb is obtained from the English sentence 
( ‘ing’ ), as it would be done for any other verb. Note that the adjective quiet appears at the right 
place even though MWEs did not mention anything about adjectives. Similarly, the auxiliary 
verbs are produced correctly and at the right place. All this is the result of combining the 
linguistic knowledge with the MWE patterns. Fig. 3 summarises this whole process. 



[[Godhra || 


((was simmering#)) 

1 


[[with quiet pain#]] 


IMP 


i 

VG 


PP 



1 

After reordering 

1 

Nl> PI* VG 

I 

Godhra NIL *l_se_bhabhaka 



Source Sentence 






Chunking 






Chunked 


Sentence 




Fig. 2. Processing an Actual Sentence Fig. 3. Summary of processing an Actual Sentence 



The elegance of the solution is that linguistic processing proceeds in the usual way as 
all other steps, giving great power to this approach. Learning pathway templates comprise of 
templates as well as the special steps, which are different from the usual steps. In the example 
above, the procedure part of the pathway consists of two special processing steps interspersed 
with the usual processing. Thus, it consists of both declarative as well as procedural parts 
where procedural parts consist of the special steps. As mentioned earlier, it connects language 
data (or patterns) to linguistic theory and the linguistic processing system. 
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5 Experiment 

As the patterns that have already been analyzed manually by Linguists are available (SAID 
idioms dictionary, LDC), we conducted an experiment to evaluate the system based on that 
data. We picked a representative sample of 100 patterns along with their translations. The 
analyzed data was processed using the 2 nd approach for compiling the pattern. The system 
performance was then evaluated. 

6 Evaluation 



The system is tested on a set of 230 sentences extracted from BNC corpus containing various 
MWEs, and the result is compared with output obtained without the specialized processing 
ofMWEs. 



Number of distinct MWEs 


100 


Number of sentences 


230 


Number of sentences in which translation of MWEs improved 


139 


Number of sentences in which translation of MWEs remained same 


12 


Number of sentences in which chunking was not compatible with the one required 
to process the MWEs 


61 


Number of sentences in which translation of MWEs was wrong 


18 



7 Major Limitations 

1 . Chunking output has errors. Hence, it does not match the compiled pattern and therefore, 
it is not processed. 

For example, in the sentence, 

“And striker Geoff Ferris is likely to put pen to paper for 12 months” 

Here, the idiom "put pen to paper” was analysed by chunker as 
((put)) [[pen]] ((to paper)) 

VG& NP VINF 

Instead of 

((put)) [[pen]] [[to paper]] 

VG& NP PP 

2. It has been observed that when ordinary linguists are asked to translate an idiom from one 
language to another (SAID idioms dictionary, LDC), they find it difficult to do it without 
looking at the example sentence containing the MWE. An example sentence helps the 
bilingual to deduce the meaning of an idiom in case of lack of familiarity. Hence, an 
example sentence is a must with every idiom in the dictionary. 

8 Conclusion 

In this paper, we have introduced a dynamic learning system that can take non-linguistic 
patterns for dealing with MWEs, interpret them linguistically, and use them in conjunction 
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with the main linguistic system. This is made possible by the use of statistical taggers and 

specially designed learning pathway templates. 

The system is carefully crafted so that it can be used by bilinguals to give patterns for 

handling MWEs. At the same time, it can be and it is implemented efficiently. 
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Abstract. The Szeged Corpus is a manually annotated natural language corpus 
comprising 1.2 million word entries plus 225 thousand punctuation marks. With this, it 
is the largest manually processed Hungarian textual database that serves as a reference 
material for further research in natural language processing (NLP) as well as a learning 
database for machine learning algorithms and other software applications. Language 
processing of the corpus texts so far included morpho-syntactic analysis, POS tagging 
and shallow syntactic parsing. Semantic information was also added to a pre-selected 
section of the corpus to support automated information extraction (IE). 



1 Introduction 

The present state of the Szeged Corpus 1 [1] is the result of three national projects and 
the cooperation of three consortium partners. The corpus currently comprises approx. 1.2 
million word entries, 145 thousand different word forms, and an additional 225 thousand 
punctuation marks. The first version of the corpus contains texts from five topic areas, roughly 
200 thousand words each, equalling a 1 million word textual database. The second version 
was extended with a sample of texts including another 200 thousand words. Texts have gone 
through different phases of natural language processing (NLP) and analysis. Extensive and 
accurate manual annotation of the texts, incorporating over 124 person-months of manual 
work, is a great merit of the corpus. 

Initially, corpus words were morpho-syntactically analysed with the help of the Humor 2 
automatic pre-processor and then manually POS tagged by linguistic experts. The Hun- 
garian version of the internationally acknowledged MSD (Morpho-Syntactic Description) 
scheme [2] was used for the encoding of the words. Due to the fact that the MSD encod- 
ing scheme is extremely detailed (one label can store morphological information on up to 
1 7 positions), there is a large number of ambiguities, i.e. roughly every second word of the 
corpus is ambiguous. Disambiguation, therefore, required accurate and detailed work cumu- 
lating up to 64 person-months of manual annotation. Currently all possible labels as well as 
the selected ones are stored in the corpus. 

1 The different versions of the Szeged Corpus are available at 
http: //www. inf .u-szeged.hu/hlt. 

2 The Humor morpho-syntactic analyser is a product of the MorphoLogic Ltd. Budapest. 

Petr Sojka, Ivan Kopecek, and Karel Pala (Eds.): TSD 2004, LNAI 3206, pp. 41-47, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 
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A unique feature of the corpus is that parallel to POS tagging, annotators defined users’ 
rules for a pre-defined set of ambiguous words. The aim of applying users’ rules was to mark 
the relevant context that determines the selection of a certain POS tag. Experieces showed that 
users’ rules are highly accurate and very specific, therefore, are a valuable source of added 
linguistic information and are also suitable to support the more precise training of machine 
learning algorithms. 

Following that, texts of the Szeged Corpus were shallow parsed, during which annotators 
marked noun phrase structures and clause structures. The linguistic information identified by 
shallow syntactic parsing is rich enough to support a number of large-scale NLP applications 
including information extraction (IE), phrase identification in information retrieval, named 
entity identification, and a variety of text-mining operations. Noun phrase (NP) and clause 
(CP) annotation was conducted manually on Szeged Corpus 2.0 (1.2 million words) on 
automatically pre-parsed sentences. Pre-parsing was completed with the help of the CLaRK 
programme 3 , in which syntactic rules have been defined by linguistic experts for the 
recognition of NPs. Manual validation and correction followed the pre-parsing process and 
lasted 60 person-months. 

Due to the accurate and exhaustive manual annotation, the resulting corpus (both first and 
second versions) could serve as an adequate database for the training and testing of machine 
learning algorithms. Different kinds of POS taggers have been trained and tested on the 
corpus, and results showed that despite the agglutinating nature of Hungarian language, all 
methods can be used effectively. Machine learning algorithms were also applied for learning 
NP recognition rules on the basis of the annotated texts. A pre-selected section of the corpus 
(200 thousand words of short business news) was marked up with simple semantic features as 
well, which served as an experimental database for the training of machine learning methods 
for IE. 

Current works aim at a more detailed syntactic analysis of the Szeged Corpus. With this, 
developers intend to lay the foundation of a Hungarian treebank, which is planned to be 
enriched with detailed semantic information as well in the future. 

2 Related Works 

Corpus-based methods play an important role in empirical linguistics as well as in the 
application of machine learning algorithms. Annotated reference corpora, such as the Brown 
Corpus [3], the Penn Treebank [4], the Susanne Corpus [5], and the BNC [6], have 
helped both the development of English computational linguistics tools and English corpus 
linguistics. Typical corpus projects follow the Penn Treebank approach, which distinguishes 
a POS tagging and a syntactic parsing phase each comprising an automatic annotation step 
followed by manual validation and correction. Recently, there have been several efforts 
to build annotated corpora for other languages as well, such as French [7], Italian [8], 
Russian [9], German and Slavic languages. 

The NEGRA [10] POS tagged and syntactically annotated corpus of 355 thousand 
tokens was the first initiative in corpus linguistics for German. The more recent TIGER 
Treebank project [11] aims at building the largest and most extensively annotated treebank 

3 The CLaRK system was developed by Kiril Simov at the Bulgarian Academy of Sciences in the 
framework of the BulTreeBank project (http : / / www . bultreebank . org). 



